Ship 8 Tranche 5a: high-bit 4:2:0 RGBA u8 SIMD#25
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds SIMD-backed support for high-bit-depth 4:2:0 RGBA (u8 output) row conversion paths across multiple architectures and wires them into the public row dispatch layer.
Changes:
- Add
use_simd-controlled dispatcher routing for high-bit 4:2:0 RGBA u8 conversions (YUV420p 9/10/12/14, P010/P012, YUV420p16, P016). - Implement RGBA SIMD entrypoints by reusing existing RGB kernels via shared
*_to_rgb_or_rgba_rowimplementations with anALPHAconst parameter. - Add scalar↔SIMD byte-equivalence tests for the new RGBA SIMD paths across backends.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/row/mod.rs | Wires high-bit 4:2:0 RGBA u8 dispatchers to per-arch SIMD implementations with scalar fallback. |
| src/row/arch/x86_sse41.rs | Adds SSE4.1 RGBA wrappers/shared kernels (ALPHA=true) for high-bit 4:2:0 and P010/P012/P016 families. |
| src/row/arch/x86_sse41/tests.rs | Adds SSE4.1 scalar equivalence tests for the new RGBA high-bit 4:2:0 paths. |
| src/row/arch/x86_avx2.rs | Adds AVX2 RGBA wrappers/shared kernels (ALPHA=true) for high-bit 4:2:0 and P010/P012/P016 families. |
| src/row/arch/x86_avx2/tests.rs | Adds AVX2 scalar equivalence tests for the new RGBA high-bit 4:2:0 paths. |
| src/row/arch/x86_avx512.rs | Adds AVX-512 RGBA wrappers/shared kernels (ALPHA=true) for high-bit 4:2:0 and P010/P012/P016 families. |
| src/row/arch/x86_avx512/tests.rs | Adds AVX-512 scalar equivalence tests for the new RGBA high-bit 4:2:0 paths. |
| src/row/arch/wasm_simd128.rs | Adds wasm simd128 RGBA wrappers/shared kernels (ALPHA=true) for high-bit 4:2:0 and P010/P012/P016 families. |
| src/row/arch/wasm_simd128/tests.rs | Adds wasm simd128 scalar equivalence tests for the new RGBA high-bit 4:2:0 paths. |
| src/row/arch/neon.rs | Adds NEON RGBA wrappers/shared kernels (ALPHA=true) for high-bit 4:2:0 and P010/P012/P016 families. |
| src/row/arch/neon/tests.rs | Adds NEON scalar equivalence tests for the new RGBA high-bit 4:2:0 paths. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| fn check_planar_u8_sse41_rgba_equivalence_n<const BITS: u32>( | ||
| width: usize, | ||
| matrix: ColorMatrix, | ||
| full_range: bool, | ||
| ) { |
There was a problem hiding this comment.
The newly added RGBA equivalence helpers/tests call SSE4.1 intrinsics unconditionally (via unsafe { yuv_420p_n_to_rgba_row... }, etc.). Unlike the existing SSE4.1 tests earlier in this file, these helpers don’t gate execution with std::arch::is_x86_feature_detected!("sse4.1"), which can cause SIGILL on CPUs without SSE4.1 (and may also break under Miri if the detection would otherwise early-return). Add the same feature-detection guard (either in each helper or at the start of each #[test]).
| width: usize, | ||
| matrix: ColorMatrix, | ||
| full_range: bool, | ||
| ) { |
There was a problem hiding this comment.
The new AVX2 RGBA equivalence helpers/tests invoke AVX2 intrinsics unconditionally. Existing AVX2 tests in this file early-return when !std::arch::is_x86_feature_detected!("avx2"); these new ones should do the same to avoid illegal-instruction crashes on non-AVX2 hosts (and to match the established test pattern in this module).
| ) { | |
| ) { | |
| if !std::arch::is_x86_feature_detected!("avx2") { | |
| return; | |
| } |
| width: usize, | ||
| matrix: ColorMatrix, | ||
| full_range: bool, | ||
| ) { |
There was a problem hiding this comment.
The new AVX-512 RGBA equivalence helpers/tests call AVX-512BW intrinsics unconditionally. Other AVX-512 tests in this file are guarded by std::arch::is_x86_feature_detected!("avx512bw"); without the same guard these tests can SIGILL on CPUs lacking AVX-512BW. Add the feature-detection early-return (in helpers or per-test) consistent with the rest of the file.
| ) { | |
| ) { | |
| if !std::arch::is_x86_feature_detected!("avx512bw") { | |
| return; | |
| } |
Summary
Adds u8 RGBA SIMD across all 5 backends for high-bit 4:2:0 YUV (
yuv420p9/10/12/14/16,p010/p012/p016) and wires them into the 8 high-bit u8 RGBA dispatchers in `src/row/mod.rs`. Builds on the scalar prep + dispatcher signatures landed in PR #24. The companion u16 RGBA SIMD work is deferred to Tranche 5b.Changes
5 SIMD backends — NEON / SSE4.1 / AVX2 / AVX-512 / wasm simd128 — each gain a const-generic `*_to_rgb_or_rgba_row<BITS, ALPHA>` template across 4 kernel families:
Existing RGB and new RGBA wrappers are thin shims over the shared template. Only the store (`vst3q_u8` vs `vst4q_u8`, `write_rgb_` vs `write_rgba_`) and the scalar tail dispatch branch on `ALPHA`; per-pixel math is unchanged.
8 high-bit u8 RGBA dispatchers wired in `src/row/mod.rs` (`yuv420p9/10/12/14/16_to_rgba_row`, `p010/p012/p016_to_rgba_row`) — replace the prior `let _ = use_simd` stubs with the standard `cfg_select!` per-arch route block, mirroring the existing RGB dispatchers. `use_simd = false` still forces scalar.
Per-backend RGBA equivalence tests — ~30 new `#[test]` functions across the 5 backend test modules. Each new x86 test gates on `is_x86_feature_detected!` so the suite stays clean under sanitizer/Miri/non-feature-flagged CI runners.
Compile-time `const { assert!(BITS == ...) }` retained on every shared template (was already a Codex-flagged hardening from prior tranches).
Test plan
🤖 Generated with Claude Code